欢迎关注“小丫画图”公众号,回复“小白”,看小视频,实现点鼠标跑代码。
小丫微信: epigenomics E-mail: figureya@126.com
作者:Caelum,他的更多作品看这里https://k.youshop10.com/M4k338pp
小丫编辑校验
画出CNV的这个图。
出自https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-021-01322-w,跟FigureYa259circLink和FigureYa262GDC出自同一篇文章
Fig. 1 Genetic and transcriptional alterations of RNA modification “writers” in CRC. c Bar graphs showing the frequency of CNV gain (red), loss (blue) and non_CNV (green) of RNA modification “writers” in the TCGA-COAD/READ cohort. The height of each bar represents the alteration frequency.
外显子测序数据可以画这样的图。
RNA-seq数据也可以借助这样的分析来深挖。例如RNA-seq筛出成百上千个差异基因,谁才是关键基因?它通过什么方式影响了下游基因的表达?或许是某些基因发生了高频copy number variation(gain/loss)。
使用国内镜像安装包
options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
加载包
library(magrittr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
##
library(TCGAbiolinks)
library(clusterProfiler)
##
## clusterProfiler v3.16.1 For help: https://guangchuangyu.github.io/software/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:purrr':
##
## simplify
## The following object is masked from 'package:stats':
##
## filter
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor
从GDC下载hg38的CNV数据,以COAD/READ为例。
easy_input_gene.txt,要看哪些基因的CNV,就写到这个文件里。甚至可以把所有差异基因或参与某个通路的基因都拿来计算,查看输出文件output_cnv.csv。然后筛选gain或loss频率高的前十几个基因,进行画图展示。
cnv_COAD <- GDCquery(project = "TCGA-COAD",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores") %T>%
GDCdownload %>% GDCprepare
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-COAD
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
## Downloading data for project TCGA-COAD
## Of the 1 files for download 1 already exist.
## All samples have been already downloaded
## Reading GISTIC file
## Reading file: GDCdata/TCGA-COAD/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/7f01e47a-2f4c-4b91-8db6-e1b6b5595390/COAD.focal_score_by_genes.txt
##
indexing COAD.focal_score_by_genes.txt [=------------------] 34.31GB/s, eta: 0s
indexing COAD.focal_score_by_genes.txt [==================] 208.03MB/s, eta: 0s
dim(cnv_COAD)
## [1] 19729 509
cnv_READ <- GDCquery(project = "TCGA-READ",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores") %T>%
GDCdownload %>% GDCprepare
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-READ
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
## Downloading data for project TCGA-READ
## Of the 1 files for download 1 already exist.
## All samples have been already downloaded
## Reading GISTIC file
## Reading file: GDCdata/TCGA-READ/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/8dcb2bfd-0e8a-402d-941c-def06f2e8a96/READ.focal_score_by_genes.txt
##
indexing READ.focal_score_by_genes.txt [=====--------------] 92.45GB/s, eta: 0s
indexing READ.focal_score_by_genes.txt [==================] 282.75MB/s, eta: 0s
dim(cnv_READ)
## [1] 19729 173
# 合并COAD和READ
cnv <- cbind(cnv_COAD, cnv_READ)
dim(cnv)
## [1] 19729 682
# 删除重复的sample
cnv <- cnv[,!duplicated(colnames(cnv))]
dim(cnv)
## [1] 19729 679
# 根据需要定义一个基因集
# 可以从文件读入
genes <- read.table("easy_input_gene.txt", header = T)$SYMBOL
# 也可以直接在这里写入
#genes <- c("METTL3", "METTL14", "FTO", "ALKBH5","IGF2BP3",
# "YTHDF1", "YTHDF2", "YTHDF3", "IGF2BP1", "IGF2BP2", )
cnv2 <- cnv %>%
# 理解ENSEMBL命名方法,提取前15位作为基因名
mutate(ENSEMBL = str_sub(`Gene Symbol`, 1, 15)) %>%
# 转换为基因名SYMBOL
inner_join(bitr(.$ENSEMBL,
fromType = "ENSEMBL",
toType = "SYMBOL",
OrgDb = org.Hs.eg.db)) %>%
# 去除转换时可能产生的重复项
distinct(SYMBOL, .keep_all = TRUE) %>%
# 仅保留目的基因
filter(SYMBOL %in% genes) %>%
# 将SYMBOL作为行名,去除非数字列干扰
column_to_rownames("SYMBOL") %>%
# 去除其他干扰信息
dplyr::select(-`Gene ID`, -Cytoband, -ENSEMBL, -`Gene Symbol`) %>%
# 理解TCGA barcode,仅保留肿瘤样本数据
select_if(str_sub(colnames(.), 14, 15) < 10) %>%
# +1代表gain,-1代表loss,统计每一行的loss和gain
mutate(CNV_loss = apply(., 1, function(x) sum(x == -1)/length(x)),
CNV_gain = apply(., 1, function(x) sum(x == 1)/length(x))) %>%
# 仅保留loss和gain数据
dplyr::select(CNV_loss, CNV_gain) %>%
# 重新把行名当做基因名
rownames_to_column("gene") %>%
# 统计没有CNV的样本的比例
mutate(none_CNV = 1 - CNV_loss - CNV_gain) %>%
# 将宽数据转变为长数据,便于绘图
pivot_longer(!gene, names_to = "Group", values_to = "pct") %>%
# 转换为百分比
mutate(pct = pct * 100)
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(.$ENSEMBL, fromType = "ENSEMBL", toType = "SYMBOL", OrgDb =
## org.Hs.eg.db): 3.48% of input gene IDs are fail to map...
## Joining, by = "ENSEMBL"
# 保存到文件
write.csv(cnv2, "output_cnv.csv", row.names = F, quote = F)
# 按照easy_input_gene.txt文件中的顺序画基因
#cnv2$gene <- factor(cnv2$gene, levels = genes)
# 或者按照CNV_gain由大到小的顺序画基因
cnv2_gain <- cnv2[cnv2$Group == "CNV_gain",]
cnv2_gain_sorted <- cnv2_gain[order(cnv2_gain$pct, decreasing = T),]
cnv2$gene <- factor(cnv2$gene, levels = cnv2_gain_sorted$gene)
# 按照loss、gain、none的顺序画bar
cnv2$Group <- factor(cnv2$Group, levels = c("CNV_loss", "CNV_gain", "none_CNV"))
ggplot(cnv2, aes(x = gene, y = pct, fill = Group)) +
# 绘制堆积条形图
geom_col(position = "stack") +
# 添加标题
ggtitle(paste0("TCGA-COAD/READ\nn=", ncol(cnv) - 3)) +
# 手动添加颜色
scale_fill_manual(values = c("steelblue3", "firebrick2", "forestgreen")) +
xlab("m6A genes") +
ylab("Frequency of group(%)") +
theme_classic() +
theme(axis.line = element_line(size = 1, lineend = "square"),
axis.title = element_text(size = 15,face = "bold"),
axis.text.y = element_text(size = 12,face = "bold"),
axis.text.x = element_text(size = 12,face = "bold", angle = 90, hjust = 1),
axis.ticks.length = unit(0.25,"cm"),
axis.ticks = element_line(size = 1),
legend.text = element_text(size = 12,face = "bold"),
legend.title = element_text(size = 15,face = "bold"))
ggsave("CNV.pdf", width = 6, height = 4)
我们算出的frequency跟原文差距较大,跟cBioPortal结果相近。跟原文作者联系,还没有回复。
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] clusterProfiler_3.16.1 TCGAbiolinks_2.16.4 org.Hs.eg.db_3.11.4
## [4] AnnotationDbi_1.50.3 IRanges_2.22.2 S4Vectors_0.26.1
## [7] Biobase_2.48.0 BiocGenerics_0.34.0 forcats_0.5.1
## [10] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [13] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6
## [16] ggplot2_3.3.5 tidyverse_1.3.1 magrittr_2.0.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.4.1
## [3] fastmatch_1.1-3 BiocFileCache_1.12.1
## [5] plyr_1.8.6 igraph_1.2.10
## [7] splines_4.0.2 BiocParallel_1.22.0
## [9] GenomeInfoDb_1.24.2 urltools_1.7.3
## [11] digest_0.6.29 yulab.utils_0.0.4
## [13] htmltools_0.5.2 GOSemSim_2.14.2
## [15] viridis_0.6.2 GO.db_3.11.4
## [17] fansi_0.5.0 memoise_2.0.1
## [19] tzdb_0.2.0 graphlayouts_0.7.2
## [21] modelr_0.1.8 matrixStats_0.61.0
## [23] vroom_1.5.7 R.utils_2.11.0
## [25] askpass_1.1 enrichplot_1.8.1
## [27] prettyunits_1.1.1 colorspace_2.0-2
## [29] blob_1.2.2 rvest_1.0.2
## [31] rappdirs_0.3.3 ggrepel_0.9.1
## [33] haven_2.4.3 xfun_0.29
## [35] crayon_1.4.2 RCurl_1.98-1.5
## [37] jsonlite_1.7.2 scatterpie_0.1.7
## [39] glue_1.5.1 polyclip_1.10-0
## [41] gtable_0.3.0 zlibbioc_1.34.0
## [43] XVector_0.28.0 DelayedArray_0.14.1
## [45] scales_1.1.1 DOSE_3.14.0
## [47] DBI_1.1.1 Rcpp_1.0.7
## [49] viridisLite_0.4.0 progress_1.2.2
## [51] gridGraphics_0.5-1 bit_4.0.4
## [53] europepmc_0.4.1 httr_1.4.2
## [55] fgsea_1.14.0 RColorBrewer_1.1-2
## [57] ellipsis_0.3.2 pkgconfig_2.0.3
## [59] XML_3.99-0.8 R.methodsS3_1.8.1
## [61] farver_2.1.0 sass_0.4.0
## [63] dbplyr_2.1.1 utf8_1.2.2
## [65] labeling_0.4.2 ggplotify_0.1.0
## [67] tidyselect_1.1.1 rlang_0.4.12
## [69] reshape2_1.4.4 munsell_0.5.0
## [71] cellranger_1.1.0 tools_4.0.2
## [73] cachem_1.0.6 downloader_0.4
## [75] cli_3.1.0 generics_0.1.1
## [77] RSQLite_2.2.9 ggridges_0.5.3
## [79] broom_0.7.10 evaluate_0.14
## [81] fastmap_1.1.0 yaml_2.2.1
## [83] knitr_1.37 bit64_4.0.5
## [85] fs_1.5.2 tidygraph_1.2.0
## [87] ggraph_2.0.5 R.oo_1.24.0
## [89] DO.db_2.9 xml2_1.3.3
## [91] biomaRt_2.44.4 compiler_4.0.2
## [93] rstudioapi_0.13 curl_4.3.2
## [95] reprex_2.0.1 tweenr_1.0.2
## [97] bslib_0.3.1 stringi_1.7.6
## [99] highr_0.9 lattice_0.20-45
## [101] Matrix_1.4-0 vctrs_0.3.8
## [103] pillar_1.6.4 lifecycle_1.0.1
## [105] BiocManager_1.30.16 triebeard_0.3.0
## [107] jquerylib_0.1.4 data.table_1.14.2
## [109] cowplot_1.1.1 bitops_1.0-7
## [111] GenomicRanges_1.40.0 qvalue_2.20.0
## [113] R6_2.5.1 gridExtra_2.3
## [115] MASS_7.3-54 assertthat_0.2.1
## [117] SummarizedExperiment_1.18.2 openssl_1.4.5
## [119] withr_2.4.3 GenomeInfoDbData_1.2.3
## [121] hms_1.1.1 ggfun_0.0.4
## [123] grid_4.0.2 rvcheck_0.2.1
## [125] rmarkdown_2.11 ggforce_0.3.3
## [127] lubridate_1.8.0